cm p - lg / 9 60 60 22 17 J un 1 99 6 Two Questions about Data - Oriented Parsing

نویسنده

Rens Bod

چکیده

In this paper I present ongoing work on the data-oriented parsing (DOP) model. In previous work, DOP was tested on a cleaned-up set of analyzed part-of-speech strings from the Penn Treebank, achieving excellent test results. This left, however, two important questions unanswered: (1) how does DOP perform if tested on unedited data, and (2) how can DOP be used for parsing word strings that contain unknown words? This paper addresses these questions. We show that parse results on unedited data are worse than on cleaned-up data, although still very competitive if compared to other models. As to the parsing of word strings, we show that the hardness of the problem does not so much depend on unknown words, but on previously unseen lexical categories of known words. We give a novel method for parsing these words by estimating the probabilities of unknown subtrees. The method is of general interest since it shows that good performance can be obtained without the use of a part-of-speech tagger. To the best of our knowledge, our method outperforms other statistical parsers tested on Penn Treebank word strings. 1 Introduction The Data-Oriented Parsing (DOP) method suggested in Scha (1990) and developed in Bod (1992-1995) is a probabilistic parsing strategy which does not single out a narrowly predefined set of structures as the statistically significant ones. It accomplishes this by maintaining a large corpus of analyses of previously occurring utterances. New input is parsed by combining tree-fragments from the corpus; the frequencies of these fragments are used to estimate which analysis is the most probable one. In previous work, we tested the DOP method on a cleaned-up set of analyzed part-of-speech strings from the Penn Treebank (Marcus et al., 1993), achieving excellent test results (Bod, 1993a,b). This left, however, two important questions unanswered: (1) how does DOP perform if tested on unedited data, and (2), how can DOP be used for parsing word strings that contain unknown words? This paper addresses these questions. The rest of it is divided into three parts. In section 2 we give a short resume of the DOP method. In section 3 we address the first question: how does DOP perform on unedited data? In section 4 we deal with the question how DOP can be used for parsing word strings that contain unknown words. This second question turns out to be the actual focus of the article, while the answer …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

cm p - lg / 9 60 70 10 9 Ju l 1 99 6 Efficient Implementation of a Semantic - based Transfer Approach 1

This article gives an overview of a new semantic-based transfer approach developed and applied within the Verbmobil Machine Translation project [22]. We present the declarative transfer formalism and discuss its implementation. The results presented in this paper have been integrated successfully in the Verbmobil system.

متن کامل

ar X iv : c m p - lg / 9 60 50 18 v 1 1 3 M ay 1 99 6 Efficient Tabular LR Parsing

We give a new treatment of tabular LR parsing, which is an alternative to Tomita’s generalized LR algorithm. The advantage is twofold. Firstly, our treatment is conceptually more attractive because it uses simpler concepts, such as grammar transformations and standard tabulation techniques also know as chart parsing. Secondly, the static and dynamic complexity of parsing, both in space and time...

متن کامل

ar X iv : c m p - lg / 9 60 40 08 v 1 2 2 A pr 1 99 6 Efficient Algorithms for Parsing the DOP Model ∗

Excellent results have been reported for DataOriented Parsing (DOP) of natural language texts (Bod, 1993c). Unfortunately, existing algorithms are both computationally intensive and difficult to implement. Previous algorithms are expensive due to two factors: the exponential number of rules that must be generated and the use of a Monte Carlo parsing algorithm. In this paper we solve the first p...

متن کامل

ar X iv : c m p - lg / 9 60 40 21 v 1 2 9 A pr 1 99 6 Extended Dependency Structures and their Formal Interpretation ∗ Marc

We describe two " semantically-oriented " dependency-structure formalisms, U-forms and S-forms. U-forms have been previously used in machine translation as interlingual representations, but without being provided with a formal interpretation. S-forms, which we introduce in this paper, are a scoped version of U-forms, and we define a composi-tional semantics mechanism for them. Two types of sema...

متن کامل

ar X iv : q - a lg / 9 60 70 28 v 1 2 7 Ju l 1 99 6 Examples of Categorification

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1996

cm p - lg / 9 60 60 22 17 J un 1 99 6 Two Questions about Data - Oriented Parsing

نویسنده

چکیده

منابع مشابه

cm p - lg / 9 60 70 10 9 Ju l 1 99 6 Efficient Implementation of a Semantic - based Transfer Approach 1

ar X iv : c m p - lg / 9 60 50 18 v 1 1 3 M ay 1 99 6 Efficient Tabular LR Parsing

ar X iv : c m p - lg / 9 60 40 08 v 1 2 2 A pr 1 99 6 Efficient Algorithms for Parsing the DOP Model ∗

ar X iv : c m p - lg / 9 60 40 21 v 1 2 9 A pr 1 99 6 Extended Dependency Structures and their Formal Interpretation ∗ Marc

ar X iv : q - a lg / 9 60 70 28 v 1 2 7 Ju l 1 99 6 Examples of Categorification

عنوان ژورنال:

اشتراک گذاری